An Arabic-Hebrew parallel corpus of TED talks

نویسنده

  • Mauro Cettolo
چکیده

We describe an Arabic-Hebrew parallel corpus of TED talks built upon WIT, the Web inventory that repurposes the original content of the TED website in a way which is more convenient for MT researchers. The benchmark consists of about 2,000 talks, whose subtitles in Arabic and Hebrew have been accurately aligned and rearranged in sentences, for a total of about 3.5M tokens per language. Talks have been partitioned in train, development and test sets similarly in all respects to the MT tasks of the IWSLT 2016 evaluation campaign. In addition to describing the benchmark, we list the problems encountered in preparing it and the novel methods designed to solve them. Baseline MT results and some measures on sentence length are provided as an extrinsic evaluation of the quality of the benchmark.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

The AMARA Corpus: Building Resources for Translating the Web’s Educational Content

In this paper, we introduce a new parallel corpus of subtitles of educational videos: the AMARA corpus for online educational content. We crawl a multilingual collection community generated subtitles, and present the results of processing the Arabic–English portion of the data, which yields a parallel corpus of about 2.6M Arabic and 3.9M English words. We explore different approaches to align t...

متن کامل

Automatic Speech Recognition and Machine Translation System for MIT English Lectures Using MIT and TED Corpus

This paper presents our attempt to create English automatic speech recognition system (ASR) and English to Japanese machine translation (MT) system. We utilized existing Wall Street Journal corpus for our acoustic model and adapted it with MIT OpenCourseWare lectures while the transciptions of the MIT lectures are utilized to create the needed language model. For the parallel corpus of our stat...

متن کامل

Large-Scale Machine Translation between Arabic and Hebrew: Available Corpora and Initial Results

Machine translation between Arabic and Hebrew has so far been limited by a lack of parallel corpora, despite the political and cultural importance of this language pair. Previous work relied on manually-crafted grammars or pivoting via English, both of which are unsatisfactory for building a scalable and accurate MT system. In this work, we compare standard phrase-based and neural systems on Ar...

متن کامل

Machine Translation between Hebrew and Arabic: Needs, Challenges and Preliminary Solutions

Modern Hebrew and Modern Standard Arabic, both Semitic languages, share many orthographic, lexical, morphological, syntactic and semantic similarities, but they are still not mutually comprehensible. Most native Hebrew speakers in Israel do not speak Arabic, and the vast majority of Arabs (outside Israel) do not speak Hebrew. Machine translation (MT) between these two language has the potential...

متن کامل

Enhancing the TED-LIUM Corpus with Selected Data for Language Modeling and More TED Talks

In this paper, we present improvements made to the TED-LIUM corpus we released in 2012. These enhancements fall into two categories. First, we describe how we filtered publicly available monolingual data and used it to estimate well-suited language models (LMs), using open-source tools. Then, we describe the process of selection we applied to new acoustic data from TED talks, providing addition...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • CoRR

دوره abs/1610.00572  شماره 

صفحات  -

تاریخ انتشار 2016